In this practical, we will focus on many multiple data sets and some popular classification methods:
Default: KNN-classification and logistic regressiontitanic disaster: Support Vector MachinesYou can use the Table of Contents to the right to quickly navigate through this document.
One of the packages we will use is class. For this, you will probably need to install.packages("class") before running the library() functions. ISLR is also a new package, that needs to be installed to access the Default data. This may hold for more packages that are defined below. I’d like to specifically mention the caret package, which brings together the state-of-the-art in statistical learning applications across R. It is a wonderful package to run, compare and evaluate machine learning models in R.
library(MASS)
library(tidyverse)
library(magrittr)
library(class)
library(ISLR)
library(tidyverse)
library(caret)
library(kernlab)
library(MLeval)
library(pROC)
Before starting with the exercises, it is a good idea to set your seed, so that (1) your answers are reproducible and (2) you can compare your answers with the answers provided.
set.seed(45)
The Default data set (from package ISLR) contains credit card loan data for 10.000 people. The goal is to classify credit card cases as yes or no based on whether they will default on their loan.
A1. Create a scatterplot of the Default dataset, where balance is mapped to the x position, income is mapped to the y position, and default is mapped to the colour. Can you see any interesting patterns already?
ISLR::Default %>%
arrange(default) %>% # so the yellow dots are plotted after the blue ones
ggplot(aes(x = balance, y = income, colour = default)) +
geom_point(size = 1.3) +
theme_minimal()
It is clear that people with a larger remaining balance are more likely to default. When you look carefully, you may be able to identify two clusters in the data. One cluster for lower incomes and one cluster for higher incomes. The probability to default follows the same pattern in both clusters: higher remaining balance means an increase in probability to default.
A2. Add the line + facet_grid(cols = vars(student)) to the plot. What do you see?
Default %>%
arrange(default) %>% # so the yellow dots are plotted after the blue ones
ggplot(aes(x = balance, y = income, colour = default)) +
geom_point(size = 1.3) +
theme_minimal() +
facet_grid(cols = vars(student))
Clearly, if we facet the plot over students and non-students, we see that the lower income group is well-represented by the students.
A3. Transform “student” into a dummy variable using ifelse() (0 = not a student, 1 = student). Then, randomly split the Default dataset into a training set train (80%) and a test set test (20%).
# Create train/test split
trainIndex <- createDataPartition(Default$default, p = .8, list = FALSE)
Default %<>%
mutate(student = ifelse(student == "Yes", 1, 0))
train <- Default[trainIndex, ]
test <- Default[-trainIndex, ]
The above code splits the data into two parts:
These parts are stored in the train and test objects, respectively. The goal of this splitting is to identify the fit of the model by not using all cases to train the model on. If we would train the model on the same data that we would evaluate the model on, we run the risk of overfitting our model and fovouring our evaluations. By holding out a part of the data we can test if our model indeed fits on cases that have never been seen by the model before.
Now that we have explored the dataset, we can start on the task of classification. We can imagine a credit card company wanting to predict whether a customer will default on the loan so they can take steps to prevent this from happening.
The first method we will be using is k-nearest neighbours (KNN). It classifies datapoints based on a majority vote of the k points closest to it. In R, the class package contains a knn() function to perform knn.
A4. Create class predictions for the test set using the knn() function. Use student, balance, and income (but no basis functions of those variables) in the train dataset. Set k to 5. Store the predictions in a variable called knn_5_pred.
knn_5_pred <- knn(
train = train %>% select(-default),
test = test %>% select(-default),
cl = as_factor(train$default),
k = 5
)
The knn_5_pred object contains the predictions for the test data column default. These predictions are obtained by applying the knn() function on the train set. To be able to generate predictions, the knn() has been given the true classes/values (cl) from the training data set. The knn() function has not seen the true classes for the `test set.
A5. Create two scatter plots with income and balance as in the first plot you made. One with the true class (default) mapped to the colour aesthetic, and one with the predicted class (knn_5_pred) mapped to the colour aesthetic.
Hint: Add the predicted class knn_5_pred to the test dataset before starting your ggplot() call of the second plot. What do you see?
# first plot is the same as before
test %>%
arrange(default) %>%
ggplot(aes(x = balance, y = income, colour = default)) +
geom_point(size = 1.3) +
theme_minimal() +
labs(title = "True class")
# second plot maps pred to colour
bind_cols(test, default_pred = knn_5_pred) %>%
arrange(default) %>%
ggplot(aes(x = balance, y = income, colour = default_pred)) +
geom_point(size = 1.3) +
theme_minimal() +
labs(title = "Predicted class (5nn)")
From these plots it is clear that there are quite some misclassifications. Many No predictions with Yes as true values, and vice versa.
A6. Repeat the same steps, but now with a knn_2_pred vector generated from a 2-nearest neighbours algorithm. Are there any differences?
knn_2_pred <- knn(
train = train %>% select(-default),
test = test %>% select(-default),
cl = train$default,
k = 2
)
# repeat the second plot from the previous exercise on the new knn predictions
bind_cols(test, default_pred = knn_2_pred) %>%
arrange(default) %>%
ggplot(aes(x = balance, y = income, colour = default_pred)) +
geom_point(size = 1.3) +
theme_minimal() +
labs(title = "Predicted class (2nn)")
Compared to the KNN (K=5) model, more people get classified as Yes. Still, the KNN (K=2) model is far from perfect. t
The confusion matrix is an insightful summary of the plots we have made and the correct and incorrect classifications therein. A confusion matrix can be made in R with the confusionMatrix() function from the caret package.
confusionMatrix(knn_2_pred, test$default)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1885 46
## Yes 48 20
##
## Accuracy : 0.953
## 95% CI : (0.9428, 0.9618)
## No Information Rate : 0.967
## P-Value [Acc > NIR] : 0.9996
##
## Kappa : 0.2742
##
## Mcnemar's Test P-Value : 0.9179
##
## Sensitivity : 0.9752
## Specificity : 0.3030
## Pos Pred Value : 0.9762
## Neg Pred Value : 0.2941
## Prevalence : 0.9670
## Detection Rate : 0.9430
## Detection Prevalence : 0.9660
## Balanced Accuracy : 0.6391
##
## 'Positive' Class : No
##
The confusion matrix is an insightful means of studying the performance of a classification model. By looking at the crosstable of predictions against observations, we can study the rates by which our fitted model results in correct and false predictions. In this case, our model results in 95.3% correct predictions on the credit card default data. However, the baseline accuracy (prevalence) of this data set is 96.7%. In other words, if we would simply have predicted our data as No Default, we would have gotten a higher accuracy. This places the accuracy for this model in another light. Just like we said in the lecture, don’t stare blind on accuracy as it is not a good measure of performance in unbalanced data.
A7. What would this confusion matrix look like if the classification were perfect?
If the classification would be perfect, the confusion matrix would be:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1933 0
## Yes 0 66
##
## Accuracy : 1
## 95% CI : (0.9982, 1)
## No Information Rate : 0.967
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.967
## Detection Rate : 0.967
## Detection Prevalence : 0.967
## Balanced Accuracy : 1.000
##
## 'Positive' Class : No
##
A8. Make a confusion matrix for the 5-nn model and compare it to that of the 2-nn model. What do you conclude?
confusionMatrix(knn_5_pred, test$default)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1919 52
## Yes 14 14
##
## Accuracy : 0.967
## 95% CI : (0.9582, 0.9744)
## No Information Rate : 0.967
## P-Value [Acc > NIR] : 0.5327
##
## Kappa : 0.2838
##
## Mcnemar's Test P-Value : 5.254e-06
##
## Sensitivity : 0.9928
## Specificity : 0.2121
## Pos Pred Value : 0.9736
## Neg Pred Value : 0.5000
## Prevalence : 0.9670
## Detection Rate : 0.9600
## Detection Prevalence : 0.9860
## Balanced Accuracy : 0.6024
##
## 'Positive' Class : No
##
The KNN (K=2) model has more true positives (yes-yes) but also more false positives (truly No in the Reference but predicted Yes). Overall the KNN (K=5) model has slightly better accuracy (proportion of correct classifications). However, although this accuracy is higher than if we would randomly assign Yes or No to cases, the performance of the model is identical to the performance when we would hava classified all values as No.
KNN directly predicts the class of a new observation using a majority vote of the existing observations closest to it. In contrast to this, logistic regression predicts the log-odds of belonging to category 1. These log-odds can then be transformed to probabilities by performing an inverse logit transform:
\[ p = \frac{1}{1+e^{-\alpha}}\], where \(\alpha\) indicates log-odds for being in class 1 and \(p\) is the probability.
Therefore, logistic regression is a probabilistic classifier as opposed to a direct classifier such as KNN: indirectly, it outputs a probability which can then be used in conjunction with a cutoff (usually 0.5) to classify new observations.
Logistic regression in R happens with the glm() function, which stands for generalized linear model. Here we have to indicate that the residuals are modeled not as a gaussian (normal distribution), but as a binomial distribution.
A9. Use glm() with argument family = binomial to fit a logistic regression model fit to the train data.
fit <- glm(default ~ ., family = binomial, data = train)
Now we have generated a model, we can use the predict() method to output the estimated probabilities for each point in the training dataset. By default predict outputs the log-odds, but we can transform it back using the inverse logit function of before or setting the argument type = "response" within the predict function.
A10. Visualise the predicted probabilities versus observed class for the training dataset in fit. You can choose for yourself which type of visualisation you would like to make. Write down your interpretations along with your plot.
tibble(observed = train$default,
predicted = predict(fit, type = "response")) %>%
ggplot(aes(y = predicted, x = observed, colour = observed)) +
geom_point(position = position_jitter(width = 0.2), alpha = .3) +
scale_colour_manual(values = c("dark blue", "orange"), guide = "none") +
theme_minimal() +
labs(y = "Predicted probability to default")
This plot shows the predicted probabilities (obtained with predict(fit, type = "response")) for all the points in the test set. We can see that the defaulting (Yes) category has a higher average probability for a default compared to the No category, but there are still data points in the No category with high predicted probability for defaulting.
One advantage of parametric procedures like logistic regression is that we get parameters (coefficients) we can interpret.
A11. Look at the coefficients of the fit model and interpret the coefficient for balance. What would the probability of default be for a person who is not a student, has an income of 40000, and a balance of 3000 dollars at the end of each month? Is this what you expect based on the plots we’ve made before?
coefs <- coef(fit)
coefs["balance"]
## balance
## 0.005728527
The higher the balance, the higher the log-odds of defaulting. To be more precise: each dollar increase in balance increases the log-odds of defaulting by 0.0057 Let’s study all coefficients.
coefs
## (Intercept) student balance income
## -1.063443e+01 -7.718385e-01 5.728527e-03 -1.168747e-06
Now, if we would like to calculate the predicted logodds of default for a person who is not a student (0 times the coefficient for student), has an income of 40000 (40000 multiplied with the coefficient for income) and a balance of 3000 dollars (3000 multiplied with the coefficient for balance), we can do the following to calculate the logodds directly:
logodds <- coefs[1] + 0*coefs[2] + 40000*coefs[4] + 3000*coefs[3]
We can then convert the logodds to a probability by
1 / (1 + exp(-logodds))
## (Intercept)
## 0.9985054
or as
plogis(logodds)
## (Intercept)
## 0.9985054
There is a probability of .999 of defaulting. This is in line with the plots we have seen before. this new data point would be all the way to the right.
In two steps, we will visualise the effect balance has on the predicted default probability.
A12. Create a data frame called balance_df with 3 columns and 500 rows: student always 0, balance ranging from 0 to 3000, and income always the mean income in the train dataset.
balance_df <- tibble(
student = rep(0, 500),
balance = seq(0, 3000, length.out = 500),
income = rep(mean(train$income), 500)
)
A13. Use this dataset as the newdata in a predict() call using fit to output the predicted probabilities for different values of balance. Then create a plot with the balance_df$balance variable mapped to x and the predicted probabilities mapped to y. Is this in line with what you expect?
balance_df$predprob <- predict(fit,
newdata = balance_df,
type = "response")
balance_df %>%
ggplot(aes(x = balance, y = predprob)) +
geom_line(col = "dark blue", size = 1) +
theme_minimal()
Just before a balance of 2000 dollars in the first plot is where the ratio of defaults vs non-defaults is 50/50. This line is exactly what we would expect.
A14. Create a confusion matrix just as the one for the KNN models by using a cutoff predicted probability of 0.5. Does logistic regression perform better?
pred_prob <- predict(fit, newdata = test, type = "response")
pred_lr <- factor(pred_prob > .5, labels = c("No", "Yes"))
confusionMatrix(pred_lr, test$default)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1923 47
## Yes 10 19
##
## Accuracy : 0.9715
## 95% CI : (0.9632, 0.9783)
## No Information Rate : 0.967
## P-Value [Acc > NIR] : 0.1429
##
## Kappa : 0.3877
##
## Mcnemar's Test P-Value : 1.858e-06
##
## Sensitivity : 0.9948
## Specificity : 0.2879
## Pos Pred Value : 0.9761
## Neg Pred Value : 0.6552
## Prevalence : 0.9670
## Detection Rate : 0.9620
## Detection Prevalence : 0.9855
## Balanced Accuracy : 0.6414
##
## 'Positive' Class : No
##
Logistic regression performs better than KNN in every way - at least for our model on this data. Remember that we started with a random seed. Every procedure that uses random numbers thereafter has become seed dependent. This also holds for the train/test split that we realized. A different random split can therefore yield different results. Cross-validation - which I excluded in this practical - can give you an indication of the variance of this difference.
Let’s take the titanic data that we used before and fit the following four models on a training version (70% of cases) of that data set.
Finally, compare the performance of all 4 techniques on the test version (30% of not yet used cases) of that data set.
We can use the following code block to directly load the data in our workspace:
con <- url("https://www.gerkovink.com/datasets/titanic.csv")
titanic <- read_csv(con)
## Rows: 887 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Sex
## dbl (6): Survived, Pclass, Age, Siblings/Spouses Aboard, Parents/Children Ab...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We need to take care of some columns that are not well-coded. Let’s make all the measurement levels as they are supposed to be. That means factors into factors, ordered factors into ordered factors, etc.
titanic %<>%
mutate(Pclass = factor(Pclass,
ordered = TRUE,
labels = c("1st class", "2nd class", "3rd class")),
Survived = factor(Survived,
labels = c("Died", "Survived")))
str(titanic)
## spec_tbl_df [887 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Survived : Factor w/ 2 levels "Died","Survived": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Ord.factor w/ 3 levels "1st class"<"2nd class"<..: 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr [1:887] "Mr. Owen Harris Braund" "Mrs. John Bradley (Florence Briggs Thayer) Cumings" "Miss. Laina Heikkinen" "Mrs. Jacques Heath (Lily May Peel) Futrelle" ...
## $ Sex : chr [1:887] "male" "female" "female" "female" ...
## $ Age : num [1:887] 22 38 26 35 35 27 54 2 27 14 ...
## $ Siblings/Spouses Aboard: num [1:887] 1 1 0 1 0 0 0 3 0 1 ...
## $ Parents/Children Aboard: num [1:887] 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num [1:887] 7.25 71.28 7.92 53.1 8.05 ...
## - attr(*, "spec")=
## .. cols(
## .. Survived = col_double(),
## .. Pclass = col_double(),
## .. Name = col_character(),
## .. Sex = col_character(),
## .. Age = col_double(),
## .. `Siblings/Spouses Aboard` = col_double(),
## .. `Parents/Children Aboard` = col_double(),
## .. Fare = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The %<>% pipe returns the result of the pipe to the object.
Let’s split the titanic data into a training and validation set. Before we do so, we fix the random number generator seed in order to allow for reproduction of our results. Any seed value will do. My favorite seed is 123.
set.seed(123)
Now we can split the data into a test and a training part.
idx <- createDataPartition(titanic$Survived, p = .7, list = FALSE)
train <- titanic[idx, ]
test <- titanic[-idx, ]
We now go through the four models where we predict Survived from the other features in titanic - with the exception of Name, naturally. If we would use Name, we would fit a zero-residual model: i.e. a model for every row seperately.
For ease of coding we exclude the Name column from the titanic set.
train %<>% select(-Name)
Again, we use the %<>% pipe because it returns the result of the pipe to the object.
Let’s fit the logistic regression model
lm.train <- glm(Survived ~ .,
data = train,
family = binomial(link = "logit"))
And generate the predicted values
lm.pred <- predict(lm.train,
newdata = test %>% select(-Name),
type = "response")
To inspect the performance of the final (and only) model:
confusionMatrix(ifelse(lm.pred < .5, "Died", "Survived") %>% factor,
test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 141 35
## Survived 22 67
##
## Accuracy : 0.7849
## 95% CI : (0.7305, 0.8328)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 2.519e-09
##
## Kappa : 0.5346
##
## Mcnemar's Test P-Value : 0.112
##
## Sensitivity : 0.8650
## Specificity : 0.6569
## Pos Pred Value : 0.8011
## Neg Pred Value : 0.7528
## Prevalence : 0.6151
## Detection Rate : 0.5321
## Detection Prevalence : 0.6642
## Balanced Accuracy : 0.7609
##
## 'Positive' Class : Died
##
Let’s train the linear kernel support vector machine
train_control <- trainControl(method="repeatedcv", number=10, repeats=3,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = FALSE)
linearSVM <- train(Survived ~.,
data = train,
method = "svmLinear",
trControl = train_control,
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.1, 10, by = .5)))
When we inspect the object we see that the optimal value for \(C\) has been trained to be 0.1
Let’s inspect the tuning parameters and the cross-validated performance on the training set.
plot(linearSVM)
Let’s also inspect the ROC curve on the cross-validated data:
plots <- evalm(linearSVM, showplots = FALSE, silent = TRUE)
plots$roc
plots$stdres
## $`Group 1`
## Score CI
## SENS 0.708 0.65-0.76
## SPEC 0.843 0.8-0.88
## MCC 0.556 <NA>
## Informedness 0.551 <NA>
## PREC 0.739 0.68-0.79
## NPV 0.821 0.78-0.86
## FPR 0.157 <NA>
## F1 0.723 <NA>
## TP 170.000 <NA>
## FP 60.000 <NA>
## TN 322.000 <NA>
## FN 70.000 <NA>
## AUC-ROC 0.720 0.68-0.76
## AUC-PR 0.580 <NA>
## AUC-PRG 0.240 <NA>
The Receiver Operator Characteristic (ROC) curve shows the trade-off between sensitivity - or true positive rate (TPR) - and specificity: 1 – false positive rate (FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. A random classifier is expected to yield predictions that result in a perfect relation between sensitivity and specificity. The ROC curve will then go along the diagonal (where FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
The ROC does not depend on the class distribution, making it very useful for evaluating classifiers that aim to predict rare events. Rare events are e.g. disease or disasters, where so-called class balances are very skewed. Accuracy would then favor classifiers that always predict a negative outcome.
We can use the area under the ROC curve (AUC) to compare different predictive classifiers. The AUC on the crossvalidated trained model is .73.
pred.probs <- predict(linearSVM,
newdata = test %>% select(-Name),
type = "prob")
ROC <- roc(predictor = pred.probs$Survived,
response = test$Survived,
plot = TRUE)
## Setting levels: control = Died, case = Survived
## Setting direction: controls < cases
ROC$auc
## Area under the curve: 0.8236
plot(ROC)
Let’s generate the predicted values
linearSVM.pred <- predict(linearSVM,
newdata = test %>% select(-Name),
type = "raw")
To inspect the performance of the final model on the test set:
confusionMatrix(linearSVM.pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 142 39
## Survived 21 63
##
## Accuracy : 0.7736
## 95% CI : (0.7184, 0.8225)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 2.807e-08
##
## Kappa : 0.5055
##
## Mcnemar's Test P-Value : 0.02819
##
## Sensitivity : 0.8712
## Specificity : 0.6176
## Pos Pred Value : 0.7845
## Neg Pred Value : 0.7500
## Prevalence : 0.6151
## Detection Rate : 0.5358
## Detection Prevalence : 0.6830
## Balanced Accuracy : 0.7444
##
## 'Positive' Class : Died
##
Let’s train the polynomial kernel support vector machine
polySVM <- train(Survived ~.,
data = train,
method = "svmPoly",
trControl = train_control,
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0.25, 2, by = .25),
scale = seq(0.1, .3, by = .1),
degree = c(1:4)))
Let’s inspect the tuning parameters and the cross-validated performance on the training set.
plot(polySVM)
polySVM
## Support Vector Machines with Polynomial Kernel
##
## 622 samples
## 6 predictor
## 2 classes: 'Died', 'Survived'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 560, 560, 560, 559, 560, 560, ...
## Resampling results across tuning parameters:
##
## C scale degree Accuracy Kappa
## 0.25 0.1 1 0.7910309 0.5557889
## 0.25 0.1 2 0.8204728 0.6153980
## 0.25 0.1 3 0.8312255 0.6389489
## 0.25 0.1 4 0.8263868 0.6288898
## 0.25 0.2 1 0.7910309 0.5557889
## 0.25 0.2 2 0.8215822 0.6190424
## 0.25 0.2 3 0.8258577 0.6272477
## 0.25 0.2 4 0.8280253 0.6330356
## 0.25 0.3 1 0.7910309 0.5557889
## 0.25 0.3 2 0.8237071 0.6226657
## 0.25 0.3 3 0.8205325 0.6158598
## 0.25 0.3 4 0.8108124 0.5864759
## 0.50 0.1 1 0.7910309 0.5557889
## 0.50 0.1 2 0.8237157 0.6230856
## 0.50 0.1 3 0.8301417 0.6364636
## 0.50 0.1 4 0.8263868 0.6286117
## 0.50 0.2 1 0.7910309 0.5557889
## 0.50 0.2 2 0.8231695 0.6213465
## 0.50 0.2 3 0.8232122 0.6224694
## 0.50 0.2 4 0.8231951 0.6211647
## 0.50 0.3 1 0.7910309 0.5557889
## 0.50 0.3 2 0.8215822 0.6183047
## 0.50 0.3 3 0.8173323 0.6088670
## 0.50 0.3 4 0.7872162 0.5268853
## 0.75 0.1 1 0.7910309 0.5557889
## 0.75 0.1 2 0.8242704 0.6244605
## 0.75 0.1 3 0.8290835 0.6339438
## 0.75 0.1 4 0.8280082 0.6323988
## 0.75 0.2 1 0.7910309 0.5557889
## 0.75 0.2 2 0.8237071 0.6221929
## 0.75 0.2 3 0.8199949 0.6146466
## 0.75 0.2 4 0.8210445 0.6163681
## 0.75 0.3 1 0.7905018 0.5545475
## 0.75 0.3 2 0.8263697 0.6273625
## 0.75 0.3 3 0.8130312 0.5992729
## 0.75 0.3 4 0.8001109 0.5592056
## 1.00 0.1 1 0.7910309 0.5557889
## 1.00 0.1 2 0.8210360 0.6178495
## 1.00 0.1 3 0.8290835 0.6341977
## 1.00 0.1 4 0.8237413 0.6237084
## 1.00 0.2 1 0.7910309 0.5557889
## 1.00 0.2 2 0.8237071 0.6220430
## 1.00 0.2 3 0.8221454 0.6197063
## 1.00 0.2 4 0.8204984 0.6141388
## 1.00 0.3 1 0.7905018 0.5545475
## 1.00 0.3 2 0.8242277 0.6222235
## 1.00 0.3 3 0.8114183 0.5949912
## 1.00 0.3 4 0.7732719 0.4820830
## 1.25 0.1 1 0.7910309 0.5557889
## 1.25 0.1 2 0.8242448 0.6242691
## 1.25 0.1 3 0.8280082 0.6319069
## 1.25 0.1 4 0.8200034 0.6150001
## 1.25 0.2 1 0.7905018 0.5545475
## 1.25 0.2 2 0.8253030 0.6251883
## 1.25 0.2 3 0.8200119 0.6148727
## 1.25 0.2 4 0.8151220 0.5990568
## 1.25 0.3 1 0.7905018 0.5545475
## 1.25 0.3 2 0.8204813 0.6149046
## 1.25 0.3 3 0.8071002 0.5834182
## 1.25 0.3 4 0.7530125 0.4251175
## 1.50 0.1 1 0.7910309 0.5557889
## 1.50 0.1 2 0.8231866 0.6218777
## 1.50 0.1 3 0.8274706 0.6306296
## 1.50 0.1 4 0.8226745 0.6209851
## 1.50 0.2 1 0.7905018 0.5545475
## 1.50 0.2 2 0.8252944 0.6246708
## 1.50 0.2 3 0.8178614 0.6100472
## 1.50 0.2 4 0.8119218 0.5946906
## 1.50 0.3 1 0.7905018 0.5545475
## 1.50 0.3 2 0.8215480 0.6177644
## 1.50 0.3 3 0.8156938 0.6041679
## 1.50 0.3 4 0.7518177 0.4223787
## 1.75 0.1 1 0.7910309 0.5557889
## 1.75 0.1 2 0.8204984 0.6158411
## 1.75 0.1 3 0.8253115 0.6261000
## 1.75 0.1 4 0.8248165 0.6256193
## 1.75 0.2 1 0.7905018 0.5545475
## 1.75 0.2 2 0.8231609 0.6200506
## 1.75 0.2 3 0.8189452 0.6130803
## 1.75 0.2 4 0.7984810 0.5590285
## 1.75 0.3 1 0.7905018 0.5545475
## 1.75 0.3 2 0.8204813 0.6151917
## 1.75 0.3 3 0.8108722 0.5930300
## 1.75 0.3 4 0.7497440 0.4247249
## 2.00 0.1 1 0.7910309 0.5557889
## 2.00 0.1 2 0.8226318 0.6202868
## 2.00 0.1 3 0.8210360 0.6169541
## 2.00 0.1 4 0.8221454 0.6194494
## 2.00 0.2 1 0.7905018 0.5545475
## 2.00 0.2 2 0.8242277 0.6222787
## 2.00 0.2 3 0.8146527 0.6036025
## 2.00 0.2 4 0.8060079 0.5765475
## 2.00 0.3 1 0.7905018 0.5545475
## 2.00 0.3 2 0.8204813 0.6154468
## 2.00 0.3 3 0.8129886 0.5980960
## 2.00 0.3 4 0.7610514 0.4509216
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
Inspect the ROC curve of the predictions
pred.probs <- predict(polySVM,
newdata = test %>% select(-Name),
type = "prob")
ROC <- roc(predictor = pred.probs$Survived,
response = test$Survived,
plot = TRUE)
## Setting levels: control = Died, case = Survived
## Setting direction: controls < cases
ROC$auc
## Area under the curve: 0.817
plot(ROC)
Now we generate the predicted values
polySVM.pred <- predict(polySVM,
newdata = test %>% select(-Name),
type = "raw")
To inspect the performance of the final model on the test set:
confusionMatrix(polySVM.pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 149 40
## Survived 14 62
##
## Accuracy : 0.7962
## 95% CI : (0.7426, 0.8431)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 1.858e-10
##
## Kappa : 0.5481
##
## Mcnemar's Test P-Value : 0.0006688
##
## Sensitivity : 0.9141
## Specificity : 0.6078
## Pos Pred Value : 0.7884
## Neg Pred Value : 0.8158
## Prevalence : 0.6151
## Detection Rate : 0.5623
## Detection Prevalence : 0.7132
## Balanced Accuracy : 0.7610
##
## 'Positive' Class : Died
##
Let’s train the polynomial kernel support vector machine
radialSVM <- train(Survived~.,
data = train,
method = "svmRadial",
trControl = train_control,
preProcess = c("center","scale"),
tuneLength = 10)
Instead of specifying a grid, we can also ask caret to utilize a tunelength of 10. It will then cycle over the hyperparameter grid conform this length. For the linear SVM kernel, there is only tuning parameter \(C\); tunelength needs more than one tuning parameter to be used. When we inspect the object we see that the optimal value for \(C\) has been trained to be 3, 0.1, 0.25
When we inspect the object we see that the optimal value for \(C\) has been trained to be 0.4301104, 0.5
Let’s inspect the tuning parameters and the cross-validated performance on the training set.
plot(radialSVM)
radialSVM
## Support Vector Machines with Radial Basis Function Kernel
##
## 622 samples
## 6 predictor
## 2 classes: 'Died', 'Survived'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 560, 560, 560, 560, 559, 560, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8290067 0.6337552
## 0.50 0.8359874 0.6471976
## 1.00 0.8359788 0.6477731
## 2.00 0.8300905 0.6330929
## 4.00 0.8295699 0.6330901
## 8.00 0.8247483 0.6239330
## 16.00 0.8225806 0.6198035
## 32.00 0.8209677 0.6168844
## 64.00 0.8145332 0.6015914
## 128.00 0.8075696 0.5820259
##
## Tuning parameter 'sigma' was held constant at a value of 0.4301104
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.4301104 and C = 0.5.
Let’s inspect the ROC curve on the predictions
pred.probs <- predict(radialSVM,
newdata = test %>% select(-Name),
type = "prob")
ROC <- roc(predictor = pred.probs$Survived,
response = test$Survived,
plot = TRUE)
## Setting levels: control = Died, case = Survived
## Setting direction: controls < cases
ROC$auc
## Area under the curve: 0.812
plot(ROC)
And generate the predicted values
radialSVM.pred <- predict(radialSVM,
newdata = test %>% select(-Name),
type = "raw")
To inspect the performance of the final model on the test set:
confusionMatrix(radialSVM.pred, test$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Died Survived
## Died 150 37
## Survived 13 65
##
## Accuracy : 0.8113
## 95% CI : (0.7589, 0.8566)
## No Information Rate : 0.6151
## P-Value [Acc > NIR] : 4.176e-12
##
## Kappa : 0.5832
##
## Mcnemar's Test P-Value : 0.001143
##
## Sensitivity : 0.9202
## Specificity : 0.6373
## Pos Pred Value : 0.8021
## Neg Pred Value : 0.8333
## Prevalence : 0.6151
## Detection Rate : 0.5660
## Detection Prevalence : 0.7057
## Balanced Accuracy : 0.7788
##
## 'Positive' Class : Died
##
In this part, we will learn how to use different ensemble methods in R, recap on how to evaluate the performance of the methods, and learn how we can substantively interpret the model output.
There are some packages we need for this part of the practical that we did not yet load: We use the following packages:
library(psych)
library(gbm)
library(xgboost)
library(data.table)
library(ggforce)
In this practical we will work with the ILPD (Indian Liver Patient Dataset) from the UCI Machine Learning Repository (you can find the data here). This data set contains data on 414 liver disease patients, and 165 non-patients. In general, medical researchers have two distinct goals when doing research: (1) to be able to classify people in their waiting room as either patients or non-patients, and (2) get insight into the factors that are associated with the disease. In this practical part, we will look at both aspects.
I have prepared the training and test data sets for you. You can load them in by running the following code block, which grabs the data from one of my repositories. The data are
con <- url("https://www.gerkovink.com/datasets/train_test.Rdata")
load(con)
We will use these data sets to make inferences and to train a prediction model.
Before we continue, we fix the random number generator seed.
set.seed(123)
C1. Get an impression of the training data by looking at the structure of the data and creating some descriptive statistics.
First we inspect the head() and tail() of the train data
head(train)
## # A tibble: 6 × 11
## Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 65 Female 0.7 0.1 187
## 2 62 Male 10.9 5.5 699
## 3 62 Male 7.3 4.1 490
## 4 58 Male 1 0.4 182
## 5 72 Male 3.9 2 195
## 6 17 Male 0.9 0.3 202
## # … with 6 more variables: Alamine_Aminotransferase <dbl>,
## # Aspartate_Aminotransferase <dbl>, Total_Protiens <dbl>, Albumin <dbl>,
## # Ratio_Albumin_Globulin <dbl>, Disease <fct>
tail(train)
## # A tibble: 6 × 11
## Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 32 Male 25 13.7 560
## 2 32 Male 12.7 8.4 190
## 3 60 Male 0.5 0.1 500
## 4 40 Male 0.6 0.1 98
## 5 31 Male 1.3 0.5 184
## 6 38 Male 1 0.3 216
## # … with 6 more variables: Alamine_Aminotransferase <dbl>,
## # Aspartate_Aminotransferase <dbl>, Total_Protiens <dbl>, Albumin <dbl>,
## # Ratio_Albumin_Globulin <dbl>, Disease <fct>
We can also obtain descriptive statistics about this data as follows
train %>%
select(-c(Gender, Disease)) %>%
describeBy(train$Disease, fast = TRUE)
##
## Descriptive statistics by group
## group: Healthy
## vars n mean sd min max range se
## Age 1 332 46.20 15.78 8.0 90.0 82.0 0.87
## Total_Bilirubin 2 332 4.13 6.23 0.4 42.8 42.4 0.34
## Direct_Bilirubin 3 332 2.02 3.27 0.1 19.7 19.6 0.18
## Alkaline_Phosphotase 4 332 313.24 238.70 75.0 1750.0 1675.0 13.10
## Alamine_Aminotransferase 5 332 98.96 217.79 12.0 2000.0 1988.0 11.95
## Aspartate_Aminotransferase 6 332 140.73 356.85 11.0 4929.0 4918.0 19.58
## Total_Protiens 7 332 6.47 1.10 2.8 9.6 6.8 0.06
## Albumin 8 332 3.06 0.78 0.9 5.5 4.6 0.04
## Ratio_Albumin_Globulin 9 332 0.92 0.34 0.3 2.8 2.5 0.02
## ------------------------------------------------------------
## group: Disease
## vars n mean sd min max range se
## Age 1 132 41.15 16.37 4.00 84.00 80.00 1.42
## Total_Bilirubin 2 132 1.20 1.11 0.50 7.30 6.80 0.10
## Direct_Bilirubin 3 132 0.42 0.57 0.10 3.60 3.50 0.05
## Alkaline_Phosphotase 4 132 222.58 154.04 90.00 1580.00 1490.00 13.41
## Alamine_Aminotransferase 5 132 33.92 23.18 10.00 181.00 171.00 2.02
## Aspartate_Aminotransferase 6 132 42.28 38.08 12.00 285.00 273.00 3.31
## Total_Protiens 7 132 6.44 1.04 3.70 9.20 5.50 0.09
## Albumin 8 132 3.29 0.79 1.40 5.00 3.60 0.07
## Ratio_Albumin_Globulin 9 132 1.03 0.29 0.37 1.85 1.48 0.03
It is quite clear that there are substantial differences between the diseased and non-diseased in the data.
C2. To further explore the data for this practical, create some interesting data visualizations that show whether there are interesting patterns in the data.
Hint: Think about adding a color aesthetic for the variable Disease.
I give here a set of visualization that I think are informative. There are many more visualization that one could create:
train %>%
select(-Gender) %>%
pivot_longer(where(is.numeric)) %>%
ggplot(aes(x = value, col = Disease, fill = Disease)) +
geom_boxplot(alpha = 0.8) +
facet_wrap(~name, scales = "free") +
scale_color_brewer(palette = "Paired") +
scale_fill_brewer(palette = "Paired") +
theme_minimal()
train %>%
select(-Gender) %>%
pivot_longer(where(is.numeric)) %>%
ggplot(aes(x = value, col = Disease, fill = Disease)) +
geom_density(alpha = 0.8) +
facet_wrap(~name, scales = "free") +
scale_color_brewer(palette = "Paired") +
scale_fill_brewer(palette = "Paired") +
theme_minimal()
prop.table(table(train$Gender, train$Disease), margin = 1) %>%
as.data.frame %>%
select(Gender = Var1, Disease = Var2, `Relative Frequency` = Freq) %>%
ggplot(aes(y = `Relative Frequency`, x = Gender, col = Disease, fill = Disease)) +
geom_histogram(alpha = 0.8, stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Paired") +
scale_color_brewer(palette = "Paired") +
theme_minimal()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
From these visualizations we can see differences between the distributions for the two
Disease categories. However, these differences do not seem to be dramatic. Additionally, there are relatively more women with the liver disease than men.
C3. Shortly reflect on the difference between bagging, random forests, and boosting.
## Bagging: fit a regression tree to N bootstrap samples of the training data
## take the average of all classification trees to base predictions on
## Note: out-of-bag data can serve as internal validation set.
## Random forest: Similarly to bagging, classification trees are trained on
## a bootstrap sample of the data. However, the decision trees
## are trained using a subset of features from the data.
## Boosting: We build a decision tree sequentially. Given the current
## we fit a (small) tree on the residuals of the current model,
## rather than on the outcome Y
We are going to apply different machine learning models using the caret package.
C4. Apply bagging to the training data, to predict the outcome Disease, using the caret package.
Note. We first specify the internal validation settings, like so:
cvcontrol <- trainControl(method = "repeatedcv",
number = 10,
allowParallel = TRUE)
These settings can be inserted within the train function from the caret package. Make sure to use the treebag method, to specify cvcontrol as the trControl argument and to set importance = TRUE.
bag_train <- train(Disease ~ .,
data = train,
method = 'treebag',
trControl = cvcontrol,
importance = TRUE)
C5. Interpret the variable importance measure using the varImp function on the trained model object.
bag_train %>%
varImp %>%
plot
C6. Create training set predictions based on the bagged model, and use the confusionMatrix() function from the caret package to assess it’s performance.`
Hint: You will have to create predictions based on the trained model for the training data, and evaluate these against the observed values of the training data.
confusionMatrix(predict(bag_train, type = "raw"),
train$Disease)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 332 1
## Disease 0 131
##
## Accuracy : 0.9978
## 95% CI : (0.9881, 0.9999)
## No Information Rate : 0.7155
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9947
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9924
## Pos Pred Value : 0.9970
## Neg Pred Value : 1.0000
## Prevalence : 0.7155
## Detection Rate : 0.7155
## Detection Prevalence : 0.7177
## Balanced Accuracy : 0.9962
##
## 'Positive' Class : Healthy
##
We have realized near-perfect training set performance. However, this shows nothing more than that we have been able to train the model rather well. We need to evaluate our model on the test set before we can draw conclusions about predicive power and test error.
C7. Now ask for the output of the bagged model. Explain why the under both approaches differ.
bag_train
## Bagged CART
##
## 464 samples
## 10 predictor
## 2 classes: 'Healthy', 'Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 417, 418, 417, 418, 418, 418, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7218316 0.2589208
We will now follow the same approach, but rather than bagging, we will train a random forest on the training data.
C8. Fit a random forest to the training data to predict the outcome Disease, using the caret library.
Note. Use the same cvcontrol settings as in the previous model.
rf_train <- train(Disease ~ .,
data = train,
method = 'rf',
trControl = cvcontrol,
importance = TRUE)
C9. Again, interpret the variable importance measure using the varImp function on the trained model object. Do you draw the same conclusions as under the bagged model?
rf_train %>%
varImp %>%
plot
The random forest model rf_train indicates a different variable importance than the bagged model bag_train. This is due to the random selection of predictors within random forests: the bootstrap-based trees are thus decorrelated.
C10. Output the model output from the random forest. Are we doing better than with the bagged model?
rf_train
## Random Forest
##
## 464 samples
## 10 predictor
## 2 classes: 'Healthy', 'Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 417, 418, 417, 418, 418, 418, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7243004 0.2497997
## 6 0.7030161 0.2179426
## 10 0.7115749 0.2307716
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Yes, the most accurate model indicates that we do just slightly better than with the bagged model. However, this might well be due to chance.
C11. Now, fit a boosting model using the caret library to predict disease status.`
Hint: Use gradient boosting (the gbm method in caret).
gbm_train <- train(Disease ~ .,
data = train,
method = "gbm",
verbose = F,
trControl = cvcontrol)
C12. Again, interpret the variable importance measure. You will have to call for summary() on the model object you just created. Compare the output to the previously obtained variable importance measures.
summary(gbm_train)
## var rel.inf
## Alkaline_Phosphotase Alkaline_Phosphotase 17.4710342
## Age Age 16.8212053
## Aspartate_Aminotransferase Aspartate_Aminotransferase 15.6829277
## Alamine_Aminotransferase Alamine_Aminotransferase 13.2912014
## Direct_Bilirubin Direct_Bilirubin 11.0980359
## Total_Protiens Total_Protiens 9.3152709
## Albumin Albumin 6.7373895
## Ratio_Albumin_Globulin Ratio_Albumin_Globulin 6.1805270
## Total_Bilirubin Total_Bilirubin 2.8300197
## GenderFemale GenderFemale 0.5723883
C13. Output the model output from our gradient boosting procedure. Are we doing better than with the bagged and random forest model?
gbm_train
## Stochastic Gradient Boosting
##
## 464 samples
## 10 predictor
## 2 classes: 'Healthy', 'Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 417, 418, 418, 418, 418, 418, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7240981 0.2019393
## 1 100 0.7239593 0.2381550
## 1 150 0.7218779 0.2268189
## 2 50 0.7219704 0.2328491
## 2 100 0.7175763 0.2311722
## 2 150 0.7283071 0.2698521
## 3 50 0.7154024 0.2243496
## 3 100 0.7088344 0.2236268
## 3 150 0.7281684 0.2802560
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
Yes, our best model is doing slightly better then the previous two models. However, the performance gain is small and might be due to random variation.
xgboost and SHAPFor now, we will continue with extreme gradient boosting, although we will use a difference procedure.
We will use xgboost to train a binary classification model, and create some visualizations to obtain additional insight in our model. We will create the visualizations using SHAP (SHapley Additive exPlanations) values, which are a measure of importance of the variables in the model. In fact, SHAP values indicate the influence of each input variable on the predicted probability for each person. Essentially, these give an indication of the difference between the predicted probability with and without that variable, for each person’s score.
C14. Download the file shap.R from this Github repository.
Note. There are multiple ways to this, of which the simplest is to run the following code.
con <- url("https://github.com/pablo14/shap-values/blob/master/shap.R?raw=TRUE")
source(con)
C15. Specify your model as follows, and use it to create predictions on the training data.
train_x <- model.matrix(Disease ~ ., train)[,-1]
train_y <- as.numeric(train$Disease) - 1
xgboost_train <- xgboost(data = train_x,
label = train_y,
max.depth = 10,
eta = 1,
nthread = 4,
nrounds = 4,
objective = "binary:logistic",
verbose = 2)
pred <- tibble(Disease = predict(xgboost_train, newdata = train_x)) %>%
mutate(Disease = factor(ifelse(Disease < 0.5, 1, 2),
labels = c("Healthy", "Disease")))
confusionMatrix(pred$Disease, train$Disease)
C16. First, calculate the SHAP rank scores for all variables in the data, and create a variable importance plot using these values. Interpret the plot.
shap_results <- shap.score.rank(xgboost_train,
X_train = train_x,
shap_approx = F)
## make SHAP score by decreasing order
var_importance(shap_results)
C17. Plot the SHAP values for every individual for every feature and interpret them.
shap_long <- shap.prep(shap = shap_results,
X_train = train_x)
plot.shap.summary(shap_long)
xgb.plot.shap(train_x, features = colnames(train_x), model = xgboost_train, n_col = 3)
The first plot demonstrates that those with a high value for Direct_Bilirubin have a lower probability of being diseased. Also, Those with a higher age have a lower probability of being diseased, while those with a higher Albumin have a higher probability of being diseased.
The second set of plots displays the marginal relationships of the SHAP values with the predictors. This conveys the same information, but in greater detail. The interpretability may be a bit tricky for the inexperienced data analyst.
C18. Verify which of the models you created in this practical performs best on the test data.
bag_test <- predict(bag_train, newdata = test)
rf_test <- predict(rf_train, newdata = test)
gbm_test <- predict(gbm_train, newdata = test)
xgb_test <- predict(xgboost_train, newdata = model.matrix(Disease ~ ., test)[,-1]) %>%
factor(x = ifelse(. < 0.5, 1, 2), levels = c(1,2), labels = c("Healthy", "Disease"))
list(`bagging` = bag_test,
`random_forest` = rf_test,
`gradient_boosting` = gbm_test,
`xtreme_gradient_boosting` = xgb_test) %>%
map(~ confusionMatrix(.x, test$Disease))
## $bagging
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 64 22
## Disease 18 11
##
## Accuracy : 0.6522
## 95% CI : (0.5577, 0.7386)
## No Information Rate : 0.713
## P-Value [Acc > NIR] : 0.9368
##
## Kappa : 0.1181
##
## Mcnemar's Test P-Value : 0.6353
##
## Sensitivity : 0.7805
## Specificity : 0.3333
## Pos Pred Value : 0.7442
## Neg Pred Value : 0.3793
## Prevalence : 0.7130
## Detection Rate : 0.5565
## Detection Prevalence : 0.7478
## Balanced Accuracy : 0.5569
##
## 'Positive' Class : Healthy
##
##
## $random_forest
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 69 25
## Disease 13 8
##
## Accuracy : 0.6696
## 95% CI : (0.5757, 0.7544)
## No Information Rate : 0.713
## P-Value [Acc > NIR] : 0.87084
##
## Kappa : 0.0941
##
## Mcnemar's Test P-Value : 0.07435
##
## Sensitivity : 0.8415
## Specificity : 0.2424
## Pos Pred Value : 0.7340
## Neg Pred Value : 0.3810
## Prevalence : 0.7130
## Detection Rate : 0.6000
## Detection Prevalence : 0.8174
## Balanced Accuracy : 0.5419
##
## 'Positive' Class : Healthy
##
##
## $gradient_boosting
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 73 22
## Disease 9 11
##
## Accuracy : 0.7304
## 95% CI : (0.6397, 0.8089)
## No Information Rate : 0.713
## P-Value [Acc > NIR] : 0.38375
##
## Kappa : 0.2534
##
## Mcnemar's Test P-Value : 0.03114
##
## Sensitivity : 0.8902
## Specificity : 0.3333
## Pos Pred Value : 0.7684
## Neg Pred Value : 0.5500
## Prevalence : 0.7130
## Detection Rate : 0.6348
## Detection Prevalence : 0.8261
## Balanced Accuracy : 0.6118
##
## 'Positive' Class : Healthy
##
##
## $xtreme_gradient_boosting
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Disease
## Healthy 64 23
## Disease 18 10
##
## Accuracy : 0.6435
## 95% CI : (0.5488, 0.7306)
## No Information Rate : 0.713
## P-Value [Acc > NIR] : 0.9579
##
## Kappa : 0.0875
##
## Mcnemar's Test P-Value : 0.5322
##
## Sensitivity : 0.7805
## Specificity : 0.3030
## Pos Pred Value : 0.7356
## Neg Pred Value : 0.3571
## Prevalence : 0.7130
## Detection Rate : 0.5565
## Detection Prevalence : 0.7565
## Balanced Accuracy : 0.5418
##
## 'Positive' Class : Healthy
##
The best performing model in this case on this data split is the gradient boosting model from exercise C11.
C19. Verify if the model is calibrated. To do so, we take the predicted values from the model in C11 and bind them together with the observed class for the test set.
out <- data.frame(obs = test$Disease,
gbm = predict(gbm_train, newdata = test, type = "prob"),
rf = predict(rf_train, newdata = test, type = "prob"),
bag = predict(bag_train, newdata = test, type = "prob"),
xgb = 1 - predict(xgboost_train, newdata = model.matrix(Disease ~ ., test)[,-1]))
out %>% head()
## obs gbm.Healthy gbm.Disease rf.Healthy rf.Disease bag.Healthy bag.Disease
## 1 Healthy 0.7291467 0.27085329 0.586 0.414 0.48 0.52
## 2 Healthy 0.6369474 0.36305262 0.510 0.490 0.40 0.60
## 3 Healthy 0.5295399 0.47046008 0.462 0.538 0.48 0.52
## 4 Healthy 0.8909366 0.10906345 0.784 0.216 0.88 0.12
## 5 Healthy 0.7790324 0.22096756 0.812 0.188 0.72 0.28
## 6 Healthy 0.9866164 0.01338356 0.908 0.092 0.92 0.08
## xgb
## 1 0.7046135
## 2 0.6667330
## 3 0.9306778
## 4 0.8322709
## 5 0.6267336
## 6 0.9783914
The caret package has a function to obtain the calibration data and to plot the calibration plot
calibration(obs ~ gbm.Healthy + rf.Healthy + bag.Healthy + xgb, data = out) %>%
xyplot( auto.key = list(columns = 2))
We can see that for none of the models the probabilities are well-calibrated.
End of all exercises